{ "cells": [ { "cell_type": "markdown", "id": "52d05953", "metadata": {}, "source": [ "# Tutorial 1 - predefined regions\n", "\n", "Demo data for the tutorial can be downloaded from [Zenodo](https://doi.org/10.5281/zenodo.16740333)" ] }, { "cell_type": "code", "execution_count": 1, "id": "97657ff4-1116-4a48-947c-f9d2c9df5064", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "import SplIsoFind" ] }, { "cell_type": "markdown", "id": "7dd9f0a6-55c8-489c-b5f9-6d7dab54da33", "metadata": {}, "source": [ "## Process allinfo file\n", "Assign cell type and brain region to every read. For this we use the allinfo_addct() function from the preprocess_scisorseqr.py scripts. This uses the allinfo file outputted by IsoQuant, the SR adata file, and the CIDmap. From the SR adata file we extract the labels so we know which cell type and region of each cell. The CIDmap file indicates which barcodes belong to which cell ID. \n", "\n", "All cells are labeled using CT_REGION. \n" ] }, { "cell_type": "code", "execution_count": 2, "id": "29bacdef-a1f8-4558-8ef5-d1e2909d0494", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of reads:\n", "446520\n", "Number of reads with at least 1 exon in intron chain:\n", "372190\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b81532aa9122400397696d88499c2ce2", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/372190 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Number of reads overlapping segmented cells:\n", "226461\n", "Number of reads with label:\n", "162445\n" ] } ], "source": [ "fn_allinfo = 'data/allinfo_ds.gz' \n", "fn_CIDmap = 'data/sample1_barcodeToPos.CellID_ds.tsv.gz' \n", "fn_adata = 'data/sample1_cellbin_adjusted.h5ad' \n", "SplIsoFind.pp.allinfo_addct(fn_allinfo, fn_CIDmap, fn_adata) " ] }, { "cell_type": "markdown", "id": "d78cc912-2965-4770-a454-f668deeebabf", "metadata": {}, "source": [ "## Create auxiliary files\n", "\n", "The files are needed by Scisorseqr for every dataset\n", " - Iso-IsoID.csv: assigns an ID to every isoform\n", " - NumIsoPerCluster: indicates how often each isoform was counted per cell type" ] }, { "cell_type": "code", "execution_count": 3, "id": "b0eb5602-1cd7-4bf8-adac-98bd32739955", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of reads:\n", "162445\n", "Number of reads with isoform assigned:\n", "105769\n" ] } ], "source": [ "fn_allinfo = 'data/allinfo_ds.filtered.labeled.gz' \n", "output_dir = 'data/scisorseqr/demo/'\n", "SplIsoFind.pp.create_auxiliary_files(fn_allinfo, output_dir)\n" ] }, { "cell_type": "markdown", "id": "ad54bbc9-1025-4732-8a9c-14fb97d645ab", "metadata": {}, "source": [ "## Running scisorseqr\n", "\n", "We run scisorseqr for different groups of cells (either all cells or a specific cell type) and brain regions (e.g. the broad brain regions, cortical layers, or hippocampal subregions). For every test (e.g. All cells in the hippocampus), we created celltype files to define the pairwise tests that scisorseqr has to perform. These files contain 4 columns:\n", "1. Name of first group (e.g. CA1_ML)\n", "2. Labels of the reads considered to be in this group (other_CA1_ML,ExciteNeuron_CA1_ML,InhibNeuron_CA1_ML,Astro_CA1_ML,Oligo_CA1_ML)\n", "3. Name of the second group (e.g. CA2)\n", "4. Labels of the reads considered to be in this group (ExciteNeuron_CA2,other_CA2,InhibNeuron_CA2,Astro_CA2)\n", "\n", "For more information and installation details of scisorseqr we refer to the original GitHub (https://github.com/tilgnerlab/scisorseqr)\n", "\n", "Below is the example code which can be ran in the terminal to run scisorseqr on all celltype files:\n", "\n", "```bash\n", "cd 'Demo/Data/scisorseqr/demo/'\n", "mkdir res_scisorseqr\n", "cd res_scisorseqr\n", "\n", "# Loop over all regions and celltypes\n", "files=$(ls ../../ct_files)\n", "for file in $files; do\n", " echo \"Filename: $file\"\n", " mkdir $file\n", " cd $file\n", " cp ../../../ct_files/$file .\n", "\n", " # Copy files to the IsoQuantOutput folder since this is what scisorseqr automatically uses as input\n", " mkdir IsoQuantOutput\n", " cp ../../Iso-IsoID.csv IsoQuantOutput/\n", " cp ../../NumIsoPerCluster IsoQuantOutput/ \n", "\n", " # Run scisorseqr\n", " Rscript -e 'library(scisorseqr); DiffSplicingAnalysis(\"'\"$file\"'\")'\n", " echo \"Finished processing $file\"\n", " echo \n", " echo\n", "\n", " cd ../\n", "done\n", "```\n", "\n", "NOTE: Due to the small size of the downsampled allinfo file, not all comparisons might work." ] }, { "cell_type": "markdown", "id": "9d456781-e037-42a9-9cbd-f064b851a2f3", "metadata": {}, "source": [ "## Plot results in a heatmap\n", "\n", "See some example plots below. " ] }, { "cell_type": "code", "execution_count": 4, "id": "08914927-1d8c-4de1-9995-37f8c03bc089", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | 1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "10 | \n", "11 | \n", "12 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
| ec4fe453-a962-419a-b961-34de928d482e_34_1753_+ | \n", "ENSMUSG00000025903.15 | \n", "other_Midbrain | \n", "ATACAGCTTGAACTGCATTCCGCGG | \n", "TGCCTCCCAC | \n", ";%;chr1_4878206_4878677_+;%;chr1_4878710_48988... | \n", "NoTSS | \n", "NoPolyA | \n", ";%;chr1_4878052_4878205_+;%;chr1_4878678_48787... | \n", "known | \n", "8 | \n", "ENSMUST00000027036.11 | \n", "protein_coding | \n", "
| 1daab6bf-00b9-46b1-ab4c-3e38c43ce7c5_31_2557_+ | \n", "ENSMUSG00000025903.15 | \n", "other_Midbrain | \n", "AGTACGTGACCCAGGGTTGTCGTAG | \n", "TACCGGTCCA | \n", ";%;chr1_4878206_4878677_+;%;chr1_4878710_48988... | \n", "NoTSS | \n", "chr1_4916963_4916963_+ | \n", ";%;chr1_4878121_4878205_+;%;chr1_4878678_48787... | \n", "known | \n", "8 | \n", "ENSMUST00000027036.11 | \n", "protein_coding | \n", "
| 5e9a5ec1-4c11-4f4c-9fde-b22b8b4ef75e_0_2538_- | \n", "ENSMUSG00000025903.15 | \n", "ExciteNeuron_CA3_ML | \n", "CTGGAAGTACTGCCTAAGACACAAG | \n", "CTAAGAGGGA | \n", ";%;chr1_4878206_4878677_+;%;chr1_4878710_48988... | \n", "NoTSS | \n", "chr1_4916963_4916963_+ | \n", ";%;chr1_4878132_4878205_+;%;chr1_4878678_48787... | \n", "known | \n", "8 | \n", "ENSMUST00000027036.11 | \n", "protein_coding | \n", "
| f227cc3d-3334-4029-a137-4c13317dd618_34_2230_+ | \n", "ENSMUSG00000025903.15 | \n", "other_L4 | \n", "GATCTATGTCTTACCACTTTAAACG | \n", "TCGAAACTGC | \n", ";%;chr1_4907298_4909609_+;%;chr1_4909712_49111... | \n", "NoTSS | \n", "chr1_4916963_4916963_+ | \n", ";%;chr1_4907278_4907297_+;%;chr1_4909610_49097... | \n", "known | \n", "3 | \n", "ENSMUST00000027036.11 | \n", "protein_coding | \n", "
| de5497de-77cd-4d78-911f-aa18af555008_10_2065_- | \n", "ENSMUSG00000025903.15 | \n", "ExciteNeuron_CA1_ML | \n", "ATGAGCGACTATGCGGTGGCTGAGC | \n", "TCCTCGCATCG | \n", ";%;chr1_4911356_4915185_+ | \n", "NoTSS | \n", "chr1_4916963_4916963_+ | \n", ";%;chr1_4911187_4911355_+;%;chr1_4915186_49169... | \n", "known | \n", "1 | \n", "ENSMUST00000027036.11 | \n", "protein_coding | \n", "